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Abstract 



We construct a pseudospectral method for the solution of time-dependent, non- linear partial differ- 
ential equations on a three-dimensional spherical shell. The problem we address is the treatment 
of tensor fields on the sphere. As a test case we consider the evolution of a single black hole in nu- 
merical general relativity. A natural strategy would be the expansion in tensor spherical harmonics 
in spherical coordinates. Instead, we consider the simpler and potentially more efficient possibility 
of a double Fourier expansion on the sphere for tensors in Cartesian coordinates. As usual for the 
double Fourier method, we employ a filter to address time-step limitations and certain stability 
issues. We find that a tensor filter based on spin-weighted spherical harmonics is successful, while 
two simplified, non-spin-weighted filters do not lead to stable evolutions. The derivatives and the 
filter are implemented by matrix multiplication for efficiency. A key technical point is the construc- 
tion of a matrix multiplication method for the spin-weighted spherical harmonic filter. As example 
for the efficient parallelization of the double Fourier, spin-weighted filter method we discuss an 
implementation on a GPU, which achieves a speed-up of up to a factor of 20 compared to a single 
core CPU implementation. 

Keywords: pseudospectral, double Fourier, spin-weighted spherical harmonics, GPU computing, 
numerical relativity 



1. Introduction 

Spectral methods are applicable to a wide range of partial differential equations, e.g. [1]. We 
consider the case of time-dependent tensor fields on a three-dimensional spherical shell. The field 
equations are assumed to be non-linear without giving rise to shocks, hence we choose pseudospec- 
tral collocation methods. Non-linearity and time-dependence may necessitate the use of filters (or 
some alternative) to stabilize the method [TJ [2]. Furthermore, as typical for spectral methods, 
the domain influences the choice of basis functions, which in turn matters for the computation of 
derivatives and for the construction of filters. 

The specific application considered in this work is a test case for numerical general relativity, 
a single Schwarzschild black hole. This is a vacuum solution of the Einstein field equations, which 
in adapted coordinates is spherically symmetric and static. However, when implemented on a 3d 
grid with the full evolution equations, some non-trivial time evolution including deviations from 
sphericity can occur. In particular, unstable modes leading to a failure of the evolution after a finite 
time can and do appear if the problem is not formulated with due care, which makes this example 
a valuable test case, e.g. [3j. Here we study the formulation given in [3], which is a first order in 
time and space reformulation of the Einstein equations in the generalized harmonic gauge (GHG). 
The GHG system including modifications for stability is an example for the class of problems that 
can be written in the form 



where u^(t,x l ) is the vector of variables, <9j = d/dx l , and a summation over up/down indices is 
assumed (i = 1,2,3). The coefficient matrices A t>J, u and may depend on but not on its 
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derivatives. For the GHG equations, greek indices label the 50 fields (p = 1, . . . , 50) representing 
specific tensor components of the field degrees of freedom. Depending on how the gauge is treated, 
this number increases to 54 or 58. Specifically, the GHG system involves rank 1, 2, and 3 tensors. 
We collect the relevant details in App. A. Spectral methods in numerical relativity are reviewed 
e.g. in [5l [6]. The computational method discussed in this work depends in part on the form of 
([!]), on that the fields are tensor components, and on the choice of a spherical shell as the domain. 
Other details of the physics should only be of secondary importance and not affect the generality 
of the discussion. 

The specific domain under consideration, spherical shells, influences the choice of basis func- 
tions. We choose a Chebyshev basis for the radial direction. For the two angular directions, the 
standard choice for scalar fields is spherical harmonics, leading to a "CY"-basis on the 3d shell. 
For tensor fields, one possibility is to employ spin-weighted spherical harmonics on the sphere, i.e. 
a "CYn" -basis, where "Yn" indicates that spin- weighted spherical harmonics are used. A general 
rank n tensor (a tensor with n indices) can be decomposed in a linear combination of spin-weight 
0, ±l,...,±n spherical harmonics. For recent work on tensor expansions with a connection to 
relativity, see [7]. 

However, especially for tensor fields, other choices of basis are possible and sometimes even 
advantageous. In this work we explore the suitability of a CFF basis, where "FF" stands for a 
double Fourier basis on the sphere [U HOj [H] . The double Fourier method includes a filter to 
address the clustering of points near the poles. The basic choice is between the "ideal" filter of 
spherical harmonic projection and simpler, less costly methods. A Y-filter is the projection on 
a finite number of spherical harmonics consisting of a forward and backward spherical harmonic 
transform. The CFF basis with a Y-filter can be equivalent to the CY method |10| [T2"] . For tensor 
fields, in some cases a CY-basis with a Yn-filter is a possible solution, see [1] for the black hole 
example. For a different formulation of the black hole problem, a CFF method with a Y-filter has 
been considered in [13|. I14j. although with evolutions that are not as stable as in [1]. 

The goal and result of the present paper is a CFF method with a Yn-filter for tensor fields 
on a shell. Our method results in long-term stable evolutions for the single black hole example 
comparable to [1], although there remains some slow, residual linear growth that we do not study 
further in this work. 

Part of the rationale behind the CFF basis [12] is that computing partial derivatives is simpler 
and usually more efficient than for a CYn or CY basis. While there exist fast Legendre transforms 
to implement the spherical harmonic derivatives, they involve a higher overhead than fast Fourier 
transforms, in particular for small transform sizes. However, although the CFF method avoids 
spherical harmonics in the derivatives, we choose to apply spherical harmonics as a filter. Since the 
FF basis on the sphere does not have uniform areal resolution, some type of spherical harmonic 
filter can be essential to alleviate the severe time-step restrictions due to the clustering of points 
near the poles in the FF basis. Comparing a CY method to a CFF with Y-filter method, the latter 
can be more efficient since the spherical harmonic transform is only used on the fields, while for 
the CY method a larger number of spherical harmonic transforms is required for the derivatives of 
the fields. Also, it can be easier to optimize a Y-filter, or to find alternatives to Y-filters, rather 
than to optimize spherical harmonic transforms per se. 

For the particular treatment of tensor fields that we consider, a tensor Yn-filter plays one 
further role, in addition to projecting onto a uniform area basis and to filtering for stability of the 
non-linear field evolution. To avoid coordinate singularities, it is convenient to express the tensor 
components with respect to global Cartesian coordinates, (x,y,z), while the collocation grid is 
based on spherical coordinates, (r,9,cj)). For example, the Cartesian components of a smooth 
vector field are smooth at the poles of the spherical grid, implying spectral convergence in a CFF 
or CY basis, where each component of the vector is expanded as if it were a scalar field. While 
spectrally convergent, the Cartesian components represent a mixture of different spin- weights that 
is not properly handled by the scalar Y-filter. In particular, the CY method of [3] displays a long- 
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term instability linked to the combination of the Cartesian components with the Y-filter. This 
instability was noted in a related context |15j and cured by a tensor spherical harmonic filter in 
the examples of [1 5|. 13], although details of the instability or the implementation were not given. 

The main topic of the present work is the double Fourier method combined with a spin- weighted 
spherical harmonic filter for tensor fields. Since we may need a Yn- filter for stability anyway, this 
paper examines the question whether we can do away with the complications of Y-derivatives and 
Yn-derivatives completely. Can we take two shortcuts (the FF basis and Cartesian components) 
and clean up with one trick (the Yn-filter) later? In the example considered, the answer is yes, and 
the method realizes the efficiency and simplicity bonus of the CFF method with Y-filter for scalar 
fields. To our knowledge, while there is literature on both the CFF method and the construction 
of Yn- filters, there is no description yet of a CFF method combined with a Yn-filter for \n\ > 1. 

An important aspect of the proposed CFF/Yn- filter method is its efficient implementation. 
Since we consider a collocation method in 3d, say with N s points, we cannot handle very large iV 
anyway. With regard to computing Id derivatives on a 3d grid, our task is a small N problem, say 
N < 100, in contrast to 2d or Id problems with much larger N. Also, in our example exponential 
convergence of the solution usually means that double precision round-off error is reached for 
N ~ 40, since there are no features on a smaller scale to be resolved. If there are local features 
to be resolved (in the black hole example, waves of small wavelength travelling to infinity), the 
recommended strategy for efficiency is not to use large iVona single domain, but rather to take a 
step towards "spectral elements" and to decompose the domain into several nested spherical shells. 
Therefore, with domain decomposition in mind for efficient 3d methods, one relevant test case to 
consider is that of a single domain where the number of points in each direction is comparatively 
small, with N < 100. 

Given a small N problem, we are led to consider matrix methods for the computation of 
derivatives and filters [16\ I17j . The operation count of a typical implementation of the partial 
differential equation is dominated by the computation of the spectral derivatives. For the 
Chebyshev and Fourier bases, we compute derivatives using Fourier transforms (FTs), where the 
standard choice for an efficient algorithm is the fast Fourier transform (FFT). However, it is also 
well-known that for sufficiently small N a FT by direct matrix multiplication can be faster than a 
FFT, since it avoids a certain overhead, e.g. [16} IT]. Furthermore, fast methods for the Legendre 
transform that is part of spherical harmonic filtering are not yet competitive with other methods 
for iV < 300 [18j, so matrix multiplication is often used by default. Note also that the computation 
of a derivative or a filter using two FTs can be combined into a single matrix multiplication. In 
the example we consider here, an implementation of the FT via matrix multiplication is found to 
be competitive or even faster than FFTs for about N < 100, see Sec. 4.2 Therefore, for the rather 
small N that we want to consider, we focus on the matrix multiplication method. 

This leads to the second topic of the paper. Since the proposed CFF method requires a Yn-filter 
as an essential part for stability, we have to address the implementation and efficiency of Yn-filters. 
We will show how Yn-filters can be implemented by a matrix multiplication method. That this is 
possible, is clear (a Yn-filter is a linear transformation of a finite number of grid values), but we 
give a prescription that is well-adapted to the present case. Even though various software libraries 
for Y-transforms are in principle available for various platforms, this is in general not true for 
Yn-transforms, so a simple prescription in terms of matrices should be of value. As a consequence 
of the time-dependence of our problem, all the required matrices for differentiation and filtering 
can be precomputed at negligible startup cost, and in our case (the Einstein equations with at 
least 50 variables) also at low memory cost. 

As a third and final topic, we address the parallelisation of the CFF/Yn-filter method on graph- 
ics cards (GPU computing). Concretely, we discuss an implementation using NVIDIA's CUDA 
framework [19] . A key issue to address is that in order to avoid the bottleneck of host-device mem- 
ory transfers, it is optimal to implement the entire calculation apart from input/output operations 
on a single graphics card. Although BLAS is available in CUDA, several non-BLAS operations 
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are required. GPU computing gives us an additional reason for a matrix method, since on new 
architectures, basic linear algebra can be expected to arrive earlier and to be better optimized than 
FFTs (as was the case for CUDA during the last years). We present some performance results 
for the CFF/Yn-filter method for the single black hole test case. The non-standard feature with 
regard to matrix computations on graphics cards is that the matrix computations involve the mul- 
tiplication of small-by-small matrices with small-by-large matrices, say a 40 x 40 times a 40 x 40000 
matrix. That is, the product of small, square differentiation and filter matrices with rectangular 
matrices representing the fields with one small and one much larger dimension. Optimization for 
such matrices was found to be less advanced than the standard square matrix case using dgemm 
for N > 1024. The required small/rectangular matrix products achieve about 50 — 100 Gflop/s, 
compared to 300 Gflop/s for large matrices and a theoretical peak around 500 Gflop/s on the avail- 
able NVIDIA hardware. The bottom line for the GPU implementation of the black hole example 
is a speed-up of a factor of about 10 — 20 compared to a single CPU implementation. 

The paper is organized as follows. In Sec. [2j we describe the CFF collation method for a 
spherical shell, in particular the computation of the pseudospectral derivatives. In Sec. [3j we 
discuss the discrete transforms for the Fourier, spherical harmonic, and spin-weighted spherical 
harmonic bases, and construct the corresponding filters. In Sec. [4j we discuss various numerical 
features of the single black hole test case and present some benchmarks. We conclude in Sec. [5] In 
App. A, we summarize the formulation of the black hole example, and App. B gives examples for 
spin-weighted spherical harmonics. 



2. Chebyshev- Fourier- Fourier collocation method 

2.1. Coordinates and collocation grid for a spherical shell 

Consider a spherical shell in three dimensions given in standard spherical coordinates by r S 

[rmin,r ma x], 9 £ [0, 7r], and <f> G [0, 2tt]. We introduce a discrete (Cartesian-product) grid on the 
shell by 

_ fmax T m in T max Train 7r/u , „ ,t -. /n\ 

r k — ^ 2 cos jy _ -y > /C — U, . . . , iV r — 1, (Z) 



i = 0,...,N e -l, (3) 
j = 0,...,AV-l. (4) 



The radial grid is adapted to a Chebyshev spectral basis. There are N r points in the radial 
direction located at the Chebyshev extrema points plus the end points of the interval [r m , n , r max ] . 
In latitude, there are Ng equally spaced points that stagger the poles at half a grid spacing. In 
longitude, there are equally spaced points. 

The collection of fields u^(t, r, 9, 4>) on the sphere that defines the state vector of the physical 
problem is represented by the spatially discrete values u^-it) = u^(t,rk,9i,(pj) at the collocation 
points. 

The collocation points in the angular direction are appropriate both for spherical harmonics, 
which we use for filters, and for the double Fourier spectral basis, which we use for the computation 
of derivatives. The double Fourier approach relies on periodicity in both angular coordinates. 
This can be made explicit by a double covering of the sphere, i.e. by doubling the range of 9 by 
chosing i = 0, . . . , 2Ng — 1 instead of i = 0, . . . , Ng — 1 while keeping the grid spacing ir/Ng fixed. 
Equivalently, we can use the identity (9, (ft) = (2ir — 9, ir + </>) between points on the sphere, which 
implies f(0,(j>) = f(2ir — 9, ir + 4>) for any function / on the sphere. The fields have to be stored 
only for the single cover, 9 6 [0,tt]. Only when the derivatives in the ^-direction are computed, 
we temporarily introduce data for 9 E [ir, 2ir] by symmetry for convenience, so that the Fourier 
derivative can be computed by the matrix multiplication discussed below. Concretely, Ng data 



points are expanded to 2Ng points, and for the matrix multiplication we use half of the standard 
matrix (a Ng x 2Ng matrix), since the result is only needed for the single cover. 

For the CFF grid, we choose an even number of points in the (^-direction, so that both <j)j 
and cj)j + 7r are part of the grid. For the spherical harmonic transform required for the filter, equal 
angular resolution is appropriate, so we set 

= 2N e . (5) 

This is also the natural choice for a physics problem that requires roughly equal angular resolution 
in 9 and (p. Taking into account the staggering in 9, we choose Ng odd so that there are points in 
the x-y-plane. In this case, Ng = 2k + 1 and Nj, = 4fe + 2 for k an integer. 

To illustrate that this is of course not the only way to define a double Fourier grid, in |14j the 
0-range is 9 £ [0, 2vr], 0, = n(2i + l)/N e for % = 0, . . . , Ng - 1, and furthermore N^ = 3N e /A with 
Ng a multiple of 4. The filter of [H] removes approximately half the modes in the (double covered) 
^-direction and one-third of the modes in the ^-direction. In the present work, similar to [4], such 
one-half or one-third rules are not used (and apparently not crucial for stability), and hence our 
grid dimensions are not adapted to such filtering. 



2.2. Cartesian tensors and smoothness 

Since we consider not just scalar but tensor fields, we have to discuss the smoothness of the 
tensor components in different coordinate systems. Given a tensor field with smooth components 
in Cartesian coordinates (x,y,z), in general its components with respect to spherical coordinates 
(r, 9, (j>) are not smooth on the z-axis. Spherical coordinates introduce a non-physical coordinate 
singularity through the Jacobian of the coordinate transformation. One possibility is to consider 
an appropriate (non-smooth) spectral basis for spherical coordinates, for example, tensor spherical 
harmonics. A simple alternative is to avoid the coordinate singularities by computing with Carte- 
sian tensor components on the spherical coordinate grid (which is not uncommon in numerical 
relativity, e.g. [U [T3l 120] ). Introducing a global Cartesian coordinate system also simplifies the 
treatment of varying coordinates in multiple grid domains. 

For example, a vector v l = [v x , v y , v z ](x, y, z) in Cartesian coordinates can be evaluated at 
the grid points of the spherical coordinate grid, x^ij = x(rk,9i,<pj) etc. As part of the spectral 
method, partial derivatives are computed along coordinate lines of spherical coordinates, that is, 
the spectral derivative operators compute d r , dg, and d^. However, for the field equations the 
result has to be expressed in Cartesian components, which is done using the chain rule. For the 
example of a vector, 

where x % = (x, y, z) and x % = (r, 9, (p). The Jacobian matrix is known analytically, with a pole 
e.g. in |^ = — gf^fl at 9 = 0, it even if r > for the shell. However, the Cartesian components 
v k (x) are constant as functions of eft as 9 — > 0,7r, hence -§^v k {x) vanishes at the poles, and the 
overall result is finite. In the numerical computation, it turns out that staggering points in the 
^-direction so that 9 = 0,ir is not part of the grid suffices for an exponentially convergent result. 
Although ^ is within half a grid-spacing of a pole, it is finite, and the spectral accuracy of the 
numerical derivatives is sufficient for the convergence of ([6]). 

If we stored v^{x) = |^V(x), then there would be the additional issue that the pole has to 

be differentiated numerically. To avoid this, we could store dual vectors, Wj(x) = ^jWi(x), where 
the inverse Jacobian is finite. However, the inverse Jacobian is multi-valued (not continuous) at 
the poles of the sphere, e.g. §f(# = 0) = r cos <j>. For the Y-basis this is an issue, since spectral 
convergence of the expansion is lost, while the Yn-basis addresses precisely this issue. The FF- 
basis does not have an immediate problem, since §| = r cos (f) cos 9 is fine as a periodic function 
for (9,(ft) E [0, 2ir] x [0,27rj. We did not explore whether the FF-basis with tensor components in 
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spherical coordinates can lead to spectral convergence for the tensor equations at hand, but rely 
on Cartesian components and the chain rule for differentiation (pi). 



2.3. Computation of derivatives in Id 

For the CFF basis, the computation of derivatives reduces to three one-dimensional derivatives 
in each of the three directions. (For CY, the spherical harmonic part is not a Id operation.) We 
compute derivatives with the matrix multiplication method, e.g. |16| 117]. For a function f(x) on a 
Id grid with N points Xi, the function values fi = f(xi) are multiplied by a N x N differentiation 
matrix Dij to obtain the approximate derivative, 

N-i 

(d x f) i =J2 D afr (7) 

3=0 

For the angular directions we assume that 2Ng and Na, are even and that the points are equally 
spaced on a periodic grid, see ^ and Q for the double cover. For N even, the Fourier differenti- 
ation matrix is 

™H = J~X' fori/j, FD d = 0. (8) 
2tan(- L g - J -) 

The Chebyshev differentiation matrix for the extrema grid x G [—1,1], Xi = — cos-^j, i = 
0,...,N- 1, is 

CDij = ^ K -^— for i^j, CD U = - V CDij, (9) 



3 ^ 3 



3=0,3^ 



where c& = 2 if k = or k = N — 1, and = 1 if < k < N — 1. The explicit value on the 
diagonal is known, but the sum in ([9]) is preferable for stability. For the radial direction, we assume 
the Chebyshev extrema grid ([2]), so the differentiation matrix has to be rescaled according to the 
linear transformation between r G ^min^max 

] and x G [—1, 1], CD^ = 2CD ij /{ 

r max ~ Train)- For 

additional details of the computation of differentiation matrices see |16 | I1T|, [2Tj . 

We compute and store the Id differentiation matrices of the CFF basis once during the initial- 
ization of the time evolution. 

2-4- Computation of derivatives in 3d 

For three-dimensional grids there are various options for the storage layout of the data and for 
the computation of partial derivatives in each of the three directions. We store the field values on 
the grid as a one-dimensional array of size = n\U2n^n v ^ where n\ = N r , n2 = 2Ng, n% = N<f>, 
and n v is the number of variables u^, fx = 0, ...,n v — 1. The relation between the Id indices in 
Q-Q and the linear 4d index is p = k + m(i + ri2(j + n^n)). 

We denote the differentiation matrices in the three spatial directions by D\ = CD niXni , D2 = 
FD n2Xn2 , D3 = FD naXn . A . The basic task for differentiation given 3d data (or 4d data for several 
variables) as a Id array is to perform matrix multiplications with a stride of 1 for the first direction, 
a stride of m for the second direction, and a stride of n\U2 for the third direction. This is 
straightforward to implement, but for efficiency we want to resort to optimized library routines. 
Unfortunately, BLAS for example does not provide strided matrix-matrix multiplication. There is 
a strided matrix-vector multiplication, but calling this repeatedly is not efficient. Since our focus 
is on emerging computing platforms like GPUs, choices for matrix libraries are rather limited, and 
hence we look for alternative implementations. 

One elegant way to proceed [T7] is to construct 3d differentiation matrices acting on one- 
dimensional arrays of size iV^ = n\U2n^ using the Kronecker product, 

Dl d = D x ®I 2 ®h, Df = h®D 2 ®h, Dl d = I x ®h®Dz, (10) 
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where the are the identity matrices of size x n^, and the Df d are of size A^ x A^. The 



computation of the spectral derivative of a 3d field given as a Id vector u using (10) is given by 
the matrix multiplication 

d k u = Dfu. (11) 
The examples in |17j implement the Du as sparse matrices in MATLAB. This leads to a very 



straightforward and quite efficient implementation of (11). 

The pseudospectral differentiation matrices D^. d can be called "semi-sparse" . For the remainder 
of this paragraph, let us set N^d = A^ 3 . A dense matrix would have N^ d = N e entries. For finite 
differencing with a stencil of constant size s (independent of N) there are s non-zero matrix elements 
per row for a total of sN^d = sN 3 elements for 3d differentiation matrices. For pseudospectral 
differentiation matrices there are about N non-zero entries per row, and A" 4 of A^ 6 elements of the 
3d differentiation matrices are non-zero. Sparse matrix libraries probably offer varying degrees of 



efficiency for the semi-sparse matrices given in (10). However, if the special sparse structure of 
the D^ d is not taken into account, then sparse matrix operations are expected to be slower than 
strided matrix multiplication due to the overhead in the index manipulations of the sparse matrix 
format (in particular, the additional memory transfer for the index data). 

The implementation that we choose uses two elementary building blocks, BLAS matrix-matrix 
multiplication and a general purpose matrix transpose. For the leading dimension of direction one, 
the indexing is such that the vector u containing the data for the 3d grid for each of the variables 
represents a vector with n\n2nzn v elements, but u can also be viewed as a n\ x n2n^n v matrix. In 
fact, 

("u)nin 2 n 3 n„ = (^)ni xn 2 n 3 n„ = { u )nin 2 xn s n v = ('u)rtin 2 n3Xn^ = { u )ni xn 2 xn 3 xn„ (1^) 

as far as the memory layout is concerned, since the different matrix sizes only refer to different 
ways to index the identical data. In our implementation (C and CUDA), this "reshape" operation 
does not require any memory copies. There could be situations where a copy operation for special 
memory alignment of the rows is required, which however would be a local copy as opposed to the 
non-local copies of e.g. a transpose operation. 

The spectral derivative in the first direction can therefore be written as the matrix multiplication 

(13) 



where with (12) the input and the result are Id arrays of size n\n2nzn v . 

For the derivatives in direction two and three, the data is not stored consecutively and we 
cannot multiply directly by D2 or D3. We implement these derivatives by performing explicit 
matrix transpositions. If direction three was the last dimension, then we could consider using 
some of the built-in transpose operations in BLAS and multiply by D3 from the right. However, 
we choose to combine all variables into one large array u nin2n3nv in order to coalesce the various 
matrix operations. BLAS offers matrix multiplications with various transposes, AB, AB T , A T B, 
and A T B T , but these are not the transposes we need. 

For the derivative in the second direction, we transpose u so that direction two becomes the 
leading dimension, multiply by D2 from the left, and then undo the tranpose, 

{v)n2n:jn v xni = ( K J!iXi!2n3«») ' (14) 

U2XU2 \ u jn2Xn^n v n\ j 

(15) 

(d2u) ni xn 2 n 3 n v = {{fav )n 2 n z n v xni ) T , (16) 



where (12) is assumed, and u and &2U are Id arrays of size n±ri2n3n v . 



Similarly, for the derivative in the third direction, 

(^)fi3Ji„ xn\U2 = ( u nin2Xn,3n v ) ; (l^) 

«3Xn3 \ w )ri2,xri v n\n2 > 

(18) 

y. (19) 



For the partial differential equations that we consider, we always need all three partial deriva- 
tives. Therefore, the computation of the derivatives as written above consists of 4 transpose 
operations and 3 matrix multiplications. In practice, we use CUBLAS and the transpose from the 
CUDA SDK. It is likely that the transpose can be optimized, but as we will see the overall per- 
formance is still dominated by the matrix multiplication. It is interesting to note that even in the 
case of a MATLAB implementation along the lines of [T7], using transposes and the effectively Id 
dense matrix multiplication for the derivatives is faster than the sparse, 3d matrix implementation 
by roughly a factor of 2. 

In terms of the operation count, the computational kernel of the pseudospectral CFF method 
as formulated above is dominated by the matrix multiplications (13), (15), and (18). They are 
given by the product of a small matrix Dk with a non-square matrix u, v, or w representing 
the data. A typical grid size for our example is n\ = n<i = 71.3 = 40 and n v = 50, so for the 
first direction the derivative is computed as the product of a 40 x 40 matrix and a 40 x 80000 
matrix. Although this is a matrix size that is suited for parallelization, the CUBLAS 3.2 library, 
for example, reaches its performance optimum for dimensions that are a multiple of 64, with double 
precision performance dropping from about 300 GFlop/s to 100 GFlop/s if a dimension is not a 
proper multiple. This is not optimal for a spectral problem where a reasonable set of convergence 
runs may consist of steps n\ = 20, 24, 28, . . . , 40. The spectral method discussed here would benefit 
most from the optimization of the matrix-matrix multiplication of a small square matrix times a 
highly non-square matrix. 



2.5. Numerical simulations 

The solution of the time-dependent problem proceeds as follows. First, the grid structure 
is initialized and all required matrices are computed and stored. The grid does not change during 
the evolution. Initial data for the physical fields u^(0) is computed. 

Second, time stepping is performed by the method of lines. We employ a simple fourth-order 
Runge-Kutta (RK4) method. The allowed size of the time-step depends on the clustering of grid 
points near the poles of the spherical shell. Depending on the relative grid dimensions, either 
the clustering in the r- or in the ^-direction is more severe. A Runge-Kutta step consists of 4 
evaluations of S^iu) — A l ^ u (u)diU u . As part of each substep, boundary conditions are applied. In 
our example, one RK4 time step involves 600 one-directional derivatives of individual fields plus 
about 10000 additional floating point operations in the computation of the right-hand-side, where 
overall the workload in the algebra is smaller than that of the derivatives. After one complete RK4 
step, we apply the filter discussed in Sec. [3] to the fields. 

The method is implemented in C/C++ in a package called BAMPS. It inherits several fea- 
tures from the code BAM, which is a mature infrastructure for black hole simulations using finite 
differences [201 E2J [23] . 

In the case of the GPU implementation, a bottleneck is the comparatively slow memory transfer 
between host and device (about 30 times slower than for device-to-device copies) . The initialization 
step is performed on the host, and all data required for the evolution is copied onto the device. 
The time evolution is carried out completely on the device. In our example this is possible due to 
the low memory requirement of the spectral method. If more memory is required than the device 
can provide, the performance assessment changes. Periodically, information about the evolution 
is copied from the device back to the host for processing. For simulations aimed at computing 
the physics of the system, the transfer bottleneck is not a major performance limitation, since the 
physical time-scale is typically much larger than the time-step size required for numerical stability 
of RK4. 
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3. Spherical harmonic filter for tensors 



3.1. Discrete Fourier transform as matrix multiplication 

As a first step we write the standard Fourier transform (e.g. PQ) and its inverse as matrix 
multiplication transformations. Consider a real, periodic function /(0) on the interval [0,27r], 
which is discretized by (f>j = ^j-j and fj = f(4>j) with j = 0, . . . , J — 1. The backward Fourier 
transform (also called Fourier synthesis or expansion in Fourier modes) is written in terms of real 
Fourier modes, 

M-l 

fj = ap + (a m cos mcfrj + b m sin m(j)j), j = 0,...,J—l. (20) 

m=l 

The forward Fourier transform (Fourier analysis, projection onto Fourier modes) is 

1 J ~ l 2 J_1 2 J_1 

ap = -j/ifj, a m = -j/ J fj cos b m = — fj sin m<f>j, m = 1, . . . , M - 1. (21) 

3=0 i=o i=o 

When counting real degrees of freedom, the number of basis functions is odd since &o is zero. This 
suggests using an odd number J = 2M — 1 of sampling points in the (p coordinate so that the 
transform is invertible. However, for the double covering of the sphere that we want to use, J 
should be even, so that given any <f>j the point <f>j + tt is part of the grid. We therefore set 

J = 2M. (22) 

When constructing filters, invertibility is not the goal anyway. 
In matrix notation, 

2irjm 2Trjm 
f = Aa + Bb, Aj = 1, A jm = cos - , B jm = cos - , (23) 

1 ~ 2 2 

a = Af , b = Bf, Aoj = — , A m j = —Aj m , B m j = —Bj m , (24) 

where m> 1. The matrix dimensions are given by fj, om, 6m-1i A j x m, and -Bj x (m-i)- The sine 
and cosine parts can be combined, 

f = Cc, c = Cf, c =(l\ C=(AB), C=(^\ (25) 

with dimensions indicated by fj, cj_i, and Cjx(j—i)- For the numerical implementation, we 
precompute the transformation matrices for the backward and for the forward transform. 

3.2. Discrete spherical harmonic transform by matrix multiplication 

We introduce the discrete spherical harmonic transform along the lines of |12j . and give an 
implementation in terms of matrix multiplication that relies on the pseudo-inverse of the Legendre 
transformation matrix computed via the singular value decomposition |24j . 

We consider functions on the sphere, f(6,cp), with the inner product (f,g) = J fgdu, duj = 
sin 8d9d(p. The spherical harmonics are denoted by 

y, m (M)=^(cos0)e im * (26) 

where the P™ are normalized associated Legendre polynomials such that (Yi m , Yii m >) = 8u'6 mm >. We 
are looking for the discretized version of the backward and forward spherical harmonic transforms, 

oo I „ 

/(M) = £ £ Qm*Jm(M), c lm = (Y lm ,f)= ldu¥ lm f. (27) 

1=0 m=-l J S2 
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We work on an equidistant two-dimensional grid of angles, for which different choices are 
possible. We choose to stagger the poles, and we choose an even number of points in the <j> direction 
(because of the Fourier double cover used for derivatives, see above). Setting = 2Ng = 2N, 
there are TV" x 2N = 2N 2 grid points, 

9i = ^(i+ l 2 ), i = 0,...,N-l, <Pj = ^j, j = 0,...,2N-l. (28) 

For real basis functions the discrete backward transform (synthesis, expansion in spherical 
harmonics) is written as 

L l 

fij = ^ Pr( cos °i)( a im cos mcpj + bi m sijxmtpj). (29) 

1=0 m=0 

We set the maximal value of I (and hence also of m) to 

L = N - 1. (30) 

Exchanging the order of summation according to Yli=o SL=o = Et=o X)£=m> spherical harmonic 
synthesis can be written as a Legendre transform followed by a standard Fourier transform, 

L L—m L—m 

fij = 'y^ J (Aj m a m (8i) + Bj m b m (9j)), a m {6i) = '^2{P m )ik{a m )k, b m (9i) = (Pm)ik{b m )k 

m=0 k=0 k=0 

(31) 

where Aj m and Bj m are the 2N x N Fourier synthesis matrices defined in (23), with J = 2N and 
M = N. For each m = 0, . . . , L we have defined the N x (N — m) matrix 

(P m ) lk = P™ +k (cos9 t ) (32) 

for the Legendre synthesis, where the entries are the normalized associated Legendre polynomials 
for a given m evaluated for I = m + k = m, . . . , L at the angles 9i, i = 0, . . . , L. 

The discrete forward transform (analysis, projection onto spherical harmonics) begins with a 
discrete forward Fourier transform in 0, ( |24[ ), leading to coefficients depending on 9, 

2N-1 2N-1 
0"m{9i) = ^ Amjfiji b m (9i) = ^ B m jfij, (33) 
3=0 3=0 

where m = 0, . . . , L, and i = 0, . . . , L. 

For each m = 0, . . . , L, the forward Legendre transform (analysis) is, conceptually, the inverse of 
the backward transform. Written in matrix notation, the backward Legendre transform (synthesis) 
from the (a m )/c to the a m {9i) in (31) becomes 

SN = PNx(N-m) a N-m, (34) 

where sn represents the synthesized iV-vector a m (9i), a^-m the N — m-vector of coefficients (a m )k, 
and PNx(N-m) is the transformation matrix. 

Here we encounter the usual mismatch between the number of grid points of the rectangular 9-<p- 
grid, which is 2N 2 , and the number of spectral coefficients (a m )/ and (b m )i, which for 1 = 0,..., N—l 
and m = 0, . . . ,1 with (6o)z = amount only to ./V 2 coefficients. Put differently, in general (34) 
cannot be inverted since for m > the matrix -P/vx(JV-m) is n °t even square. There are N equations 
for N — m unknowns aN-m- 

However, we can compute the analysis a = (P, s) by a sum over grid points, which looses 
information, so that s = Pa is an approximation of s. For the Gaussian collocation points of the 
Legendre functions (which we do not use), all that would be needed are appropriate weights Wi for 
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afc = Yli w iPik s i- For a general set of collocation points, we can define an (in general non-diagonal) 
weight matrix W so that a = P T Ws, see for example [12J, which also discusses clever ways to 
compute and store W and/or P T W. 



In principle, one could generalize [12j or the method based on special collocation points to spin- 
weighted spherical harmonics, at the cost of increased analytic complexity. However, especially in 
the context of a matrix method, there is a straightforward alternative. As a simple, direct way to 
invert s = Pa in the appropriate manner, we follow |24J and note that we can define 

Q-N—m = P (N-m)xN S Ni ( 35 ) 

where P + denotes the Moore-Penrose pseudo-inverse of the matrix P. 

The pseudo-inverse A + of a real matrix A is the unique matrix satisfying AA + A = A, A + AA + = 
A + , (AA + ) T = AA + , and (A + A) T = A + A, cmp. |25j . For example, the first relation means that 
although AA + is in general not the identity, it still maps A to A. AA + is the orthogonal projector 
onto the space spanned by the columns of A. If the inverse exists, then A + = A~ 1 . The fact 
we need here is that even if we cannot solve a linear equation Ax = b because the inverse of A 
does not exist, we can still look for vectors x that minimize \\Ax — b\\. There may be several such 
vectors. The pseudo-inverse defines the unique vector x = A + b that minimizes \\Ax — b\\ and has 
the smallest norm ||a;||. 

The pseudo-inverse can be computed using the singular value decomposition (SVD) of A, A = 
USV T . Here U, S, V are matrices, and in particular S is diagonal (and in general non-square). 
For this decomposition, we have A + = VS + U T , and the pseudo- inverse of S is obtained by taking 
its transpose and replacing non-zero entries Sa by 1/Su. 



In summary, the pseudo-inverse allows us to define the forward Legendre transform (35) as 
the least-squares approximation to the inverse of the backward transform via the pseudo- inverse. 
Written out in components, the forward Legendre transform of the discrete spherical harmonics 
transform is 

L L 

{a m )k = "^^{P^kilmiOi), {bm)k = '^2(Pm)kibm(9i), (36) 
i=0 i=0 



with a m (9i) and b m (9i) obtained from the forward Fourier transform, (33). 

We can choose to precompute and store the matrices PNx(N-m) an d Pw- m )xN ^ or eacn m - 
Since this is done once at startup time, parallelization of the SVD routine is not an issue. We use 
the GSL for Legendre polynomials and the SVD [26J. 

3.3. Discrete spin-weighted spherical harmonic transform by matrix multiplication 

Spin-weighted spherical harmonics are a generalization of spherical harmonics. The spin weight 
refers to how a given function on the sphere transforms under the rotation of basis vectors. Spin- 
weighted spherical harmonics were first discussed in terms of spin raising and lowering operators 
in \27\ I28j. which also leads to a definition in terms of Wigner d- functions. Any tensor of degree k 
on the sphere can be naturally decomposed as a linear combination of tensor spherical harmonics, 
which are products of the basis vectors with the spin- weighted spherical harmonics |28} [29], see 
Sec. GLl 

Here we use the definition given in |30| . see also |31} 132] . A spin-n function on the sphere, 
f(6,(j)), transforms under a basis rotation by an angle tp according to / = e~ in ^f. The sign 
convention for n is opposite to the spin weight s = —n defined in |27t |2"8"1 129|. which however does 
not matter for filters constructed as a forward-backward transform. The spin-weighted spherical 
harmonics, Y l r f n (9, <fi), are spin-n functions on the sphere for a given n. They form an orthonormal 
basis in the space of spin-n functions with orthonormality and completeness relations 

djJY^{u)Y^ m ,{u) = 8 iv 8 mm ,, (37) 

s 2 

EE^>')^H = (38) 

/ \m\<l 
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where oj = (9, (p) and 5(u' , uj) = <5(cos 9' — cos 0)5(4>' — (f). Hence, any spin-n function on the sphere 
is uniquely given by 



/(") = EE cf m = (rr m , f) = [ dnY^m. (39) 

l \m\<l S2 

In the above I is assumed to be equal to or larger than |n| , which is implemented with the convention 
that 

y^(w) = and cf m = if I < \n\ or I < \m\. (40) 
The definition of the spin-weighted spherical harmonics (see below) gives 

w = (-irT»)M' (4i) 

Spin-0 corresponds to the standard, non-weighted spherical harmonics, Y® m ((jj) = Yi m (to), for which 
we have the standard orthonormality and completeness relations as a special case of the relations 
above. 

For the numerical computation of the spin-weighted spherical harmonics we use recursion for- 
mulas, as opposed to the non-recursive definition of the Wigner d- functions or the spin operators 
that are also given in [30 . There are different ways to express the Y^ in terms of the Y\ m , depend- 
ing on which recursion relation is used for the ^-derivative of the associated Legendre polynomials, 
compare |30| 13 lj . While |31j is simpler in the ^-dependence of the coefficients, [30| is simpler in 
the range of I, in particular for band- limited functions on a given grid. (We note in passing that 
the coefficients for negative spin weight in [31] have to be corrected since the normalization of the 



spin-weighted spherical harmonics is non-standard and (41 ) does not hold.) A few simple examples 
can be found in App. B. 

The basic recursion formula employed in |30j is 



nk — cos i 

YL = am 1 , Y^ 1 + p nlm — Y£? m (42) 
smv sm6> 1 i,m 



for decreasing n, and for increasing n it is 



V n - n, x T + COS ^V n+l R/ 1 yn+1 (ao\ 

Y lm -«(-«)/ a Y lm ~ P(-n)lm^ o Y l-l, m> 



sin (7 v ' smi 



with coefficients 



//-n+l\5 1 / 2 l + l(l + n-l)(l 2 -m 2 )Y 

a - l = {-jT^r) ' ^ = tU^t T+n -) ■ (44) 

As before, Y^ = for I < max(|m|, |n|). 

This is a two-term recursion in /. Since the coefficients are functions of 9, the integration 
for analysis changes. Starting with n > 0, there are n + 1 terms involving multiplied by 
cot p 9 1 sin 9 9 with p + q = n. However, the overall behavior at the poles is regular. Furthermore, 
since we stagger the grid no extra measures at the poles should be necessary. The result of the 
recursion can be written 

n 

YLM) =J2^lm( d )Y(l- P )m(9^)- (45) 
The usual way to proceed is to compute the expansion coefficients with respect to the Y^ using 



some existing implementation of the spherical harmonic transform. The coefficients in (45) depend 
on 9, which means when considered as functions of 9 the terms of the expansion are not spherical 
harmonics. However, when computing the transform we can move the additional 9 dependence into 
the function that is to be transformed, e.g. (gjf^, /) = {Yi m , g^). As a result, the Y^-transform 
is computed as the linear combination of |n| + 1 YJ m -transforms of the rescaled function /. 
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In our application we implement the spin-n spherical harmonic transform as a matrix mul- 
tiplication (in particular since I is appropriately small). Rather than computing |n| + 1 spin-0 



transforms based on ( 45 ) , we use the recursion ( 42 )-(|44l) directly to compute a single transforma- 



tion matrix for the Legendre-part of the transform. For \n\ < 3, this avoids a factor of up to 4 in 
the number of transforms. 



Analytically, when computing (42)-(44) or (45) it does not matter which type of recursion 



is used (|30j or |31j). However, when computing associated Legendre polynomials from standard 
Legendre polynomials numerically, certain recursions in I are stable, while some recursions in m 
are not as stable. To our knowledge a corresponding large n study has not been carried out for 
spin-n spherical harmonics and different recursions. But note that in our case n corresponds to 
the tensor-degree of the physical fields and is therefore a small, fixed number (that in particular 
does not increase like m and I when increasing the accuracy of the spectral approximation). Still, 
the numerically implementations may differ in accuracy. 

More importantly, we have to ask whether the pseudo-inverse method is applicable to the 
computation of the analysis matrices. That the pseudo-inverse exists is more or less clear, since 
for each n we have the same orthogonality and completeness relations that hold for the n = case. 
Numerically, it is not clear a priori how well the pseudo-inverse/SVD algorithm for the analysis 
matrices handles the differences in the ^-dependence. 

We summarize the actual computation. For spin-weighted spherical harmonics we define 

Y l l l {eA) = P m i(Oy m t (46) 



where the P^ are directly related to the Wigner ci-functions, P^ = (—^) n y^j^d l m ^_ n y These 



"spin-n associated Legendre polynomials" are computed by the recursion formulas (42)-(44). In 
principle we are looking for a numerical implementation of the Wigner d-functions, but this is not 
readily available on most platforms. Given a code-library function for the computation of the nor- 
malized associated Legendre polynomials PJ n {6i), the recursion formulas are directly implemented 
by recursive function calls that increase or decrease n until n = 0. In our case, n = —3, . . . , +3, 
with n an integer. For n < 0, we can also use 

PL = (-l) n+m P^ m - (47) 

The result is the ./V x (N — m) synthesis matrix 

( P m)ik = P(k+m) m (9i) ( 48 ) 



for each m and n, in analogy to the spin-0 case, (32). For the spectral analysis we use the pseudo- 
inverse 

= (K] + k (49) 

of (Pm)ik> where as for the spherical harmonics k = 0, . . . , L — m and the (Q m )ki are (N — m) x N 
matrices. 

The spin- weighted spherical harmonic transform defines a projection filter F n (f) for functions 
/ of definite spin- weight n. Given /, we compute the discrete forward transform followed by the 



discrete backward transform for some finite I < L, cmp. (39). Note that F n is a linear operation 



and using (41) we have 



F n (f) = F-n(f). (50) 



For n/Owe have for non-trivial / that F n (f) ^ F_ n (/), so even if / = / we have F n (f) ^ F n (f). 
Hence, even if / is real, in general the projection F n (f) is complex. 

The discrete spin-n spherical harmonic transform and the corresponing filter is computed in 
complete analogy to the standard (n = 0) case. The matrices P^ and are precomputed. 



For our main application we only store the filter matrix i^(n/) as defined in (62) of Sec. 



3.5 



on filters. We need |n| < 3 and < m < N. During the evolution of the physical fields, the 
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filter is computed by computing the discrete Fourier analysis ( 33 ) , followed by the discrete spin-n 
associated Legendre projection (62), followed by the discrete Fourier synthesis (31). The Fourier 
transforms are independent of n. 



3.4- Spin-weight decomposition of tensors with respect to a tetrad or triad 

In preparation for the construction of spin- weighted filters, we decompose tensors according 
to their spin weight. Consider Minkowski space with coordinates (t, x, y, z) and metric rj a b = 
diag(— 1, 1, 1, 1). We also consider basis vectors aligned with spherical coordinates (i,r, #,(/>), but 
with components in the Cartesian basis (i, x, y, z). We define the right-handed, orthonormal tetrad 
(t a ,r a ,9 a ,4> a ) by 

t a = (— 1, 0, 0, 0), 9 a = (0, cos cos 0, cos 9 sin 0, — sin 0) , , . 

r a = (0, sm.6 cos 4>, sin 9 sine/), cos 9), <j) a = (0, — sin 4>, cos (f>, 0). 

The basis vectors tangential to the coordinate spheres are replaced by the two complex vectors 

m = -j=(0 a + i<fra), m a = -^=(9 a - i(f> a ), (52) 

where m a is the complex conjugate of m a . The orthonormality relation of the complex tetrad 
e° = (t a ,r a ,m a ,fn a ) with respect to the Minkowski metric is 

t a t a = —1, r a r a = 1, m a fn a = 1, m a m a = 1, others zero. (53) 

In terms of e^, orthonormality and completeness read ^ a &e°e^ = rj^ v and rf ,v e a l:L e\ l = rf h '. Intro- 
ducing the conjugate dual of the complex tetrad, fa = rj^Vab^t^ this becomes /fej = 5y and 
Ja — u b . 

Any tensor on Minkowski space can be written in terms of the complex tetrad. For a vector 
v a , the expansion is v a = v^e® with coefficients = {e^^v) = rf l 'ri ao ef / v b = faV a . For = 
(t a , r a , 9 a , 4> a ) , this can be written as 

v* = -t a v a , v r = r a v a , v m = m a v a , v m = m a v a , (54) 

v a = y t t a + v r r a + v m m a + v rfi Jn a_ ^ 

A tensor of degree k is expanded as 

jVLti.../Ltfc _ jmi j?fi k rpai...a k rpa\...a k _ rpfii...fi k R ai R au (56) 

This construction simplifies trivially to the case of three-dimensional Euclidean space by dropping 
t a and replacing the indices by i = 1, 2, 3 and /x = 1, 2, 3. 

The key property of the complex tetrad that concerns us here is its transformation under 
rotations about a given radial direction r a . The vectors r a and t a do not change. The vector m a 
is chosen for its simple transformation under such rotations, 

m'a = e^m a , (57) 

where ip is the angle of the rotation. The spin weight of a function / constructed from a tensor by 
contractions with the tetrad refers to its behavior under rotations of the tetrad vectors. If such a 
function transforms under tetrad rotations as 

/' = e~^ n f, (58) 



we call it a function with spin-weight n. Referring to (55), we have n(v r ) = 0, n(v m ) = +1, 
and n(v m ) = — 1. For the tetrad vectors themselves, we define n{t a ) = n(r a ) = 0, n{m a ) = — 1, 
and n(m a ) = +1. According to (58), for products of spin-weighted functions we have n(fif2) = 
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n(/i) + n(f2). For products of tetrad vectors, n(e^ . . . ) = Y^j=t n ( e< ^])- For example, r a ri,, 
r a mb, m a mb, and m a mfe have spin- weights 0, —1, —2, and 0, respectively. Products of tetrad 
vectors have a well defined spin- weight, but the sum of spin- weighted tensors is in general a tensor 
without well-defined spin weight. 

Note the distinction between coordinate rotations and tetrad rotations. By definition, any ten- 
sor is covariant under coordinate transformations, but here we have introduced additional structure, 
the tetrad, and discuss how functions that are constructed from the tetrad and tensors transform 
when the tetrad is transformed. The physics of the problem we consider is rotation invariant, i.e. 
it does not refer to a preferred choice of z-axis or tetrad vector m a . Concretely, if m a is not part of 
the construction of a physical field v a , then n{v a ) = 0. If we choose to expand v a in terms of the 
tetrad, then its components acquire specific spin weights, but each term of the sum in v a = v^e^ 
has spin-weight 0. 

3.5. Filters defined by spherical harmonic projection 

In this work, the main application of the discrete (scalar and spin- weighted) spherical harmonics 
transform is its use as a filter. It is unclear a priori what type of filtering is needed or optimal for 
the Einstein equations implemented with the particular CFF method that we consider, and any 
filtering scheme has to be carefully evaluated. 

First of all, in order to suppress high-frequency modes near the poles, we expand fij by the 
forward transform in spherical harmonics up to degree L, i.e. we project onto the spherical har- 
monics basis. The backward transform results in an approximation fij of the original fij with 
equiangular resolution over the sphere, which in particular means that the high frequencies that 
can be represented on the 0-(j) grid but are unwanted near the poles have been eliminated. This 
removes certain restrictions on the time step size due to clustering of points near the poles in 
the (^-direction. Transforming to and from spherical harmonics for a finite L defines a projection 
filter. In the context of the double Fourier spectral method on the sphere (for scalar fields), the 
projection filter ensures equivalence to the more standard spherical harmonics method to compute 
derivatives. 

We assume that L is the maximal degree of spherical harmonics represented on the grid, and 
we define additional filtering by explicitly removing the top nj of the highest degree /-modes, i.e. 
I < L — nj . In our case there are two unrelated reasons to do so. For non-linear problems, there is a 
large variety of approaches pQ to deal with the non-linear mode mixing. For example, for quadratic 
non-linearities the two-thirds rule can be helpful for one-dimensional intervals, while on the sphere 
this may become a one- half rule since the basis is not 'reflective'. It is unclear a priori what type 
of filtering is needed or optimal for the Einstein equations, which are worse than quadratically 
non-linear. As in |15} H], but in contrast to [E] which uses a different formulation of the Einstein 
equations, we do not resort to filtering one-half or one-third of the modes. This does not appear to 
be necessary, neither in the radial nor in the angular direction, and we have not investigated this 
here. However, the residual linear growth discussed in Sec. |4.1| might be addressed with additional 
filtering (or alternatively by improved boundary conditions). 

A second issue is the tensor character of the fields in combination with the Cartesian coordi- 
nates. This leads to the observation that a small nf > is required for stability (here nj = 4), 
which depends on the rank of the tensors but not on the grid size (e.g. nf = N/3 for filtering the top 
third). To examine the Cartesian tensor issue, we consider three types of filters based on scalar and 
spin- weighted spherical harmonics, which we call the scalar Y-filter, the tensor Yn- filter, and the 
graded Yg-filter. For the Y-filter, we apply the standard, non-weighted spherical harmonics filter 
to each field u^, ignoring the tensor character of the fields. The Y-filter addresses some of the clus- 
tering issues of the double Fourier method, but there remains a strong instability, which however 
appears to be cured when the Yn-filter using projection onto spin-weighted spherical harmonics is 



used, see Sec. 4.1 



One view of the problem is that Cartesian components introduce additional angular dependence 
compared to spherical coordinates, which effectively increases the order of a spherical harmonic 
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expansion by one for each spatial tensor index. Consider the spherically symmetric scalar function 
f(x, y, z) = r, which requires only I = in a spherical harmonic expansion. Its first derivative 
d x f = x / r = sin#cos(/> is the component of a Cartesian vector, which corresponds to I = 1. Its 
second derivative d x d x f = -(1 — \), which is the component of a 2-tensor, requires / = and I = 2. 



Analogously, referring to (51)— (56), each contraction with mj to compute components of a tensor 
in the spherical basis multiplies the Cartesian component by a first order polynomial in sin# etc., 
which increases the / required in a spherical harmonic basis by one. Since on the numerical grid 
we can only represent a finite, maximal degree L, for each spatial index of a tensor in Cartesian 
components the available degree L is effectively lowered by one compared to spherical coordinates. 

Let us denote by d{u^) the spatial degree of the Cartesian tensor component, i.e. the number 
of spatial indices of the variable u^. For example, d(gtt) = 0, d(gt x ) = 1, d(g xx ) = 2. Then the 
effective maximal degree L e g represented on the grid is L — d{u^). In the evolution equations and 
the constraints, the different spatial degrees are coupled, e.g. dijk — digjk- This suggests that the 
tensor Yn- filter should be used with rif > 3 so that each spin- weight mode is representable at the 
same maximal order L e g on the grid. 

On the other hand, from this point of view the scalar Y-filter is problematic, since for a given 
nf it does not project onto a basis at the same L eXX . For example, suppose we want to project 
some given 3d data onto spherically symmetric data. If we choose the Y-filter with rif = L, then 
a scalar function is correctly projected onto its spherically symmetric monopole, but for a vector 
we have to use nj = L — 1, and e.g. for d^k it should be nj = L — 3. This leads us to consider an 
improved version of the scalar filter, which was also (and possibly for the first time) considered in 
[H]. We define 

n f (fj,) = n f -d(u^), (59) 

and introduce what we call a "graded" Y-filter, or Yg- filter, where the top rif(ri) components in the 
Y-basis are zeroed. The Yg-filter improves on the Y-filter since the highest order /-modes are now 
treated consistently across the different spatial ranks of the tensor components. However, since the 
Cartesian components are actually a mixture of different spin- weights, compared to the Yn- filter 
the Yg-filter does not treat the intermediate spin- weights correctly. As we show in Sec. |4.1[ the 
Yg-filter cures one type of instability present in Y-filter simulations. Yet an additional, more slowly 
growing instability is left over, which however the Yn-filter is able to handle. 

The actual implementation of the filters is as follows. Given a general tensor, we cannot apply 



the Yn-filter directly. First, the tensor is decomposed according to (56). Each component function 



in the expansion is filtered according to its spin weight. The result is recombined again as in (56) 



Note that ( 50 ) is compatible with the m % and m ! vectors of the tetrad. For the tetrad components 



of a real vector v' 1 , F n {v m ) = F- n (v m ), which is just as it should be since the spin-weights of v m and 
v m have opposite sign. Denoting the general filter operation by F and the specific spin-n version 



by F n , we have for example F(g tt ) = F (g u ) and F(g tm ) = F 1 (g tm ). With (50) and linearity of F 



we can reexpress the filter operation in terms of non-complex basis vectors. For example, 

F(g t9 ) = ^={F 1 {g tm ) + F_ x (g m )) = V2Re(F x (g tm )) = Re(F x (g te ) + iF x {g t4> )). (60) 

The projection filter is implemented as a forward Fourier transform in the ^-direction, followed 
by the projection filter 

s N = {PP + ) NxN s N (61) 

onto the Legendre basis for each m in the ^-direction, followed by a backward Fourier transform 
in the ^-direction. Here P and P + = Q refer to the matrices appropriate for either the standard 
or spin- weighted Legendre transforms. This can be generalized to additional filtering by 

SN = F NxN Sn, FnxN = PNx(N-m)fdiag(N-m)P(N-m)xNi ( 62 ) 

where the elements of the diagonal matrix / are one for modes that are to be maintained and 
zero for the nj or nf(fi) modes that are to be removed. The standard choice considered in the 
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literature is to remove the top 4 modes [TSJ 0], in which case / = diag(l, . . . , 1, 0, 0, 0, 0). Recall 
that PNx(N-m) stands for (P m )ik with k = 0, . . . , L — m, with the entries obtained for I = m + k = 
m, . . . ,L. Therefore, zeroing the top 4 components of the (N — m)-vector P(j^_ m ^ x j^ s N removes 
the components I = L — 3, L — 2, L — 1, L. 

If storage is not an issue, we can precompute and store the different -Fjvxjv for each m, which 
requires 0(N 3 ) storage. For the Einstein equations in GHG form on a spherical shell, storage is 
not much of an issue since the filter is applied to each of about 50 variables for every value of 
the radius, and Fjy X ]y is independent of r. In our example, storing FjvxAT is roughly equivalent 
to requiring storage for 51 instead of 50 variables, with somewhat more storage required if the 
matrices for each spin-weight are stored. 



4. Numerical results 

In this section we first present numerical experiments for the single black hole test case in 



Sec. 4.1, and then evaluate the computational efficiency of the pseudospectral matrix method in 



Sec. 4.2 



4-1. Test case of a single, evolving black hole 

As a non-trivial application of the CFF/Yn-filter method, we consider the basic example of 
a static, spherically symmetric single black hole. Analytically, the time derivatives dfU^{t, x,y, z) 
all vanish. The discretization error of the numerical method leads to a non-trivial time evolution, 
which in particular can depart from spherical symmetry. The numerical method is successful if the 
system settles down in a stable stationary state of the discretized equations that approximates the 
analytical solution, where all the dtu^ have dropped to the level of the round-off error. 

We discuss a set of time evolutions on a single spherical shell. The initial data is the same in 



each case, see Appendix A. 3 but approximated on different grids of size N r x Ng x N^. For all runs 
discussed here, the radial coordinate extends from r = 1.8 to 11.8, which we label configuration 
R10. We begin our discussion with examples that are numerically stable for long times, i.e. the 
full CFF/Yn-filter method, and then discuss the effect of using different filters and different time 
step sizes. Part of the default configuration is the Yn- filter with nj = 4 and a time step with 
A = Ai/Ax m i n = 4.0 (see below). 

In Fig. [TJ we consider the evolution on a grid with dimension 25 x 9 x 18. Shown are the metric 
components gu, gtx, an d g X x and some of their time derivatives on the x-axis at different times t. In 
the top left, we show the initial data at t = and the data at t = 1000. On this scale, no evolution 
is discernible, that is the lines for t = 1000 fall on top of the lines for t = 0. In the top right, 
we show the numerical right-hand-side (rhs) for the variables, i.e. the numerical approximation to 
dtgtt, dtgtxi and dtg xx - At t = 0, they are non-zero at around 10~ 8 . This indicates that already for 
the small grid with 25 x 9 x 18 points the spectral method gives a rather accurate approximation to 
the analytic solution, for which the time derivatives vanish. In the bottom left and right of Fig. [TJ 
the time evolution of rhs (git) is shown for vertical scales of 10~ 8 and 10 -11 . The color/gray-scale 
coding shows progressing time from dark to lighter colors. There is an oscillation that is largest 
for small r. For the black hole, many quantities follow an approximate l/r p dependence, with 
gravity being strongest near the inner boundary and falling off for large radii. The amplitude of 
the oscillation decreases with time. 

In Fig. [2j we show how the oscillations are damped with time and how the solution converges 
with resolution. As a representative example for the time dependence of the system, we consider 
a norm of the right-hand-side of gu- We plot the logarithm of the infinity- norm as a function of 
time, i.e. log 10 (max |rhs(<?tt)|)(i), where the maximum is computed on the innermost sphere of the 
grid where the fields are strongest. We vary the radial resolution from N r = 13 to 37 while keeping 
the angular resolution fixed at Ng = 9 and = 18. 

Fig. [2] shows that, as expected, the analytic initial data leads to a finite error that depends 
on the spatial resolution of the grid. Key feature of these runs is stability and convergence, and 
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Figure 1: Single BH, grid RIO. Some variables on the x-axis for different times. Top left: Variables g u , gtx, and g xx 
at time t = and t — 1000. Top right: Time derivative of g t t, gtx, and g xx at t — 0. Bottom left and right: Time 
derivative of gtt during the evolution at two different scales. There is an oscillation in gtt that is largest for small 
r. The color coding indicates early times in dark, later times in brighter colors. The amplitude of the oscillation 
quickly decreases with time. 



that during the time evolution the system settles down in an approximately stationary state of the 
discretized equations. While settling down, the system oscillates with a frequency and amplitude 
that is independent of the resolution. The time dependence dies out exponentially. Exponential 
convergence with radial resolution is evident. Round-off error is reached around 10 -12 to 10 -13 . 
In this simple case, N r ~ 40 suffices to approximate the initial data and the evolution at round-off 
accuracy. 

In Fig. [3} on the left we examine the dependence of rhs((?#) on angular resolution. Although the 
initial data is spherically symmetric, since the numerical method is fully 3d deviations from spheric- 
ity occur. The initial, damped oscillations do not depend on the angular resolution. However, the 
level of the round-off error increases when the number of grid points is increased. 

In Fig. [3| right panel, we vary the time step size looking for the largest allowed time step 
giving a stable evolution. For stability of the time integration, the rule of thumb is that the 
eigenvalues of the pseudospectral spatial operator have to lie in the stability region of the method 
of line integrator, although in general this is not a sufficient condition and the pseudospectra 
have to be considered [T7]. The argument about domains of dependence leading to a Courant- 
Friedrich-Lewy condition vAt/Ax < const, where v is the propagation speed, does not apply 
directly to pseudospectral methods since the spatial stencil covers the entire domain. Here we only 
investigate stability by numerical experiment. Tab. [T] shows the result of a numerical, iterative 
search for the largest allowed time step At. We find that this is directly related to the smallest 
spatial distance on the grid. For the 3d spherical grid defined in Q-Q with = 2Ng, it 
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Figure 2: Single BH, grid RIO. Shown is the logarithm of the infinity-norm of the right-hand-side of the evolution 
equation for the variable gtt versus time. The number of grid points in the radial direction is varied from N r = 13 to 
37 while keeping the angular resolution fixed at Ne = 9 and N<j> = 18. The analytic initial data leads to a finite error 
that depends on the spatial resolution of the grid. Key feature of these runs is the exponential convergence with 
radial resolution, and that the system settles down in an approximately stationary state of the discretized equations. 




Figure 3: Single BH, rhs of gtt- Left: Different angular resolutions at fixed radial resolution. The oscillations do not 
depend on angular resolution, but the round-off floor rises with resolution. Right: For the given grid, varying the 
time step size, At = A min(Aa;), leads to stable runs for A < 4 and to unstable runs for A > 5. The stable runs settle 
down within t = 1000 of evolution time. The oscillations in the rhs (both the period and amplitude) are independent 
of A down to 10~ 12 , i.e. the discretization error associated with RK4 is less than 10~ 12 . The larger the number of 
time steps, the larger the error for the stationary regime beyond 1000M, ranging from 10" 12 for A = 0.25 to 10" 13 
for A = 4.0. The run for A = 5.0 is borderline unstable, with a comparatively slow exponential growth. The run for 
A = 6.0 fails within 50M. 



depends on the number of grid points in the different directions whether the clustering of points in 
the radial or in the (^-direction is more severe. We either have min(Ax) = min(Ar) = n — ro, or 
min(A3;) = min(2r sin(0) sin(A^/2)) « rosin(#o)(0i — 4>o)- It turns out that the "Courant factor" 
defined by 

A = At/ min(Ax), (63) 

determines stability, i.e. we should choose a time step At = Amin(Ax) with A < A max . Based 
on Tab. [TJ A max ~ 6 for grids with N r > 19, even though the smallest Ax may occur in different 
directions. As a default, we choose A = 4 in standard runs. 

In Fig. [4j we show how the evolution depends on the degree of tensor spherical harmonic 
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N r x N e x N<f, 



13 x 09 x 18 
19 x 09 x 18 
19 x 15 x 30 
25 x 09 x 18 
25 x 15 x 30 
31 x 09 x 18 
31 x 15 x 30 



At 



stab 



At uns t 
— At s tab 



0.4881 
0.4489 
0.2325 
0.2651 
0.2397 
0.1734 
0.1734 



0.0036 
0.0060 
0.0030 
0.0015 
0.0031 
0.0009 
0.0009 



At stab 

min(Ar) 



min(pA<^>) 



2.865 
5.910 
3.060 
6.198 
5.603 
6.332 
6.332 



4.497 
4.135 
5.910 
2.442 
6.093 
1.598 
4.409 



[ At stab ] 

I- min(Ax) ' 



4.4 
5.9 
5.9 
6.1 
6.0 
6.3 
6.3 



Table 1: Single BH. Empirical time-step size for stable evolutions with RK4. The data is based on a bisection search 
bracketed by values of the time step At for stable and unstable runs. Runs are called stable if they do not fail within 
t — 10000. The largest stable time step size found is denoted At sta t, while At un3t is the smallest time step found 
for an unstable run. The result can be related to the smallest grid spacing in space, which depending on N r and 
N<t, — 2N$ may be obtained for the points in the radial direction, with clustering due to the Chebyshev grid, or for 
the (^-direction, with points clustering near the poles. In this example, if min(Ar) is less than min(r sm9A(/>), then 
At a tab is independent of N$, and the time step can be chosen up to roughly 6 times larger than the smallest grid 
spacing, A max < 6. 
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Figure 4: Single BH, rhs of gtt- Dependence on n/, the number of spin- weighted spherical harmonics removed from 
the top for filtering. For nj — 0, 1, 2, the runs fail very quickly within t = 110. For nf > 3, the runs appear stable, 
although for long runs, there are some cases where n/ ■ — 3 fails earlier than the others. In most cases we set nj = 4. 



filtering, with n/ indicating the number of modes that are set to zero in the spherical harmonic 
projection. For nj = 0, projection onto tensor spherical harmonics is performed without additional 
filtering, which nevertheless removes certain high-frequency components of the double Fourier basis 
near the poles. For nj = 0, 1, 2, the runs become unstable on a very short time-scale. For nj > 3, 
the runs appear stable, although for long runs, there are some cases where nj = 3 fails earlier than 
the others. Our default choice is therefore nj = 4. This behavior is consistent with the expectation 
that the tensor rank of the fields determines the minimal n/ required for stability. In our example, 
the highest rank for the components in u^ 1 is 3, which implies that spin-weights 0, . . . , ±3 occur in 
the tensor spherical harmonic decomposition. For consistent filtering, it is not sufficient to only 
filter for weights < 3. Furthermore, we also require the derivatives d^u 11 , which raises the rank 
to 4. Apparently, since we filter the fields, even nj = 3 has a chance to work. In [1], the filter 
is applied to the right-hand-sides which are of rank 4, so in that case nf > 4 may be strictly 
necessary. Finally, we note that for the given experiment we do not seem to require a filter based 
on, say, a 2/3 or 1/2 rule, due to the non-linearity of the fields. Independent of the grid size, a 
constant nj = 4 suffices to obtain rather long-term numerical stability. 

Runs without the spin-weighted Yn-filter are not as stable for the two alternatives that we 
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Figure 5: Single BH, rhs of g u - Runs with the scalar Y-filter fail within a time of a few thousand. Top: Runs for 
different Courant factors A = Ai/Ax m i n versus time (left) and versus number of time steps (right). Runs for larger 
A last longer. The exponential growth depends on the number of time steps, but it is not a simple proportionality. 
Bottom: The instability is directly related to the 020 mode, which starts growing exponentially at about t = from 
around fO -14 . 



tried. First, we consider the basic scalar Y-filter. This filter ignores the tensor character of the 
components of u^, but each component is smooth and the approximation is spectrally convergent. 
In Fig. [5j runs with the Y-filter and nj = 4 are seen to fail within a time of about 7000. Although 
the initial damped oscillation is exactly that of the Yn-filter runs, at about t = 1000, there is 
exponential growth at a constant rate that leads to the failure of the run. For rif < 2, the runs 
are much shorter lived, while increasing rif to 4, 5, or 6 does not change the picture, similar to 
the Yn-filter runs. This does not appear to be a time-step instability due to choosing A too large, 
i.e. smaller A fail earlier. The exponential growth depends on A, but it is not simply proportional 
to the number of time steps, see the top right panel of Fig. [5] We also investigate the behavior 
of individual ai m and b[ m modes in the expansion of rhs(gtt) m non-weighted spherical harmonics, 
(29). The bottom panel of Fig. [5] demonstrates that the instability is directly related to the 020 
mode, which starts growing at about t = from around 10 -14 , overtaking the decay of the overall 
function at around t = 1000. Other modes grow as well, but we only show the largest mode. In 
other words, already at early times there is a small error at round-off accuracy that is not visible 
in ihs(gu), which seeds an unstable, unphysical mode that is not kept in check by the Y-filter. 

As an inbetween alternative to the Y- and Yn-filters, we consider the graded Yg-filter with 
nf(n) = rif — d{u^), see the discussion around (59), which takes into account the shift between 



tensors with a different number of spatial indices. Fig. [6] shows results at five different resolutions. 
The runs settle down in the same manner as before within t = 1000. The exponentially growing 
modes shown for the Y-filter in Fig. [5] do not occur. After t = 1000, there is some slow linear 
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Figure 6: Single BH, rhs of g t t- The graded Yg-filter allows simulations that last until about t = 70,000. On the 
left, we see how the run settles down exponentially by t — 1,000, which is followed by a slow linear growth until 
about t = 10, 000 to 20, 000, followed by exponential growth that eventually crashes the run. The rates of decay and 
growth are independent of radial resolution. On the right, we show for a medium resolution how growth in certain 
scalar spherical harmonic modes starts dominating the behavior of the field. 



growth. Computing vhs(gtt) for t < 10, 000 it may even appear that there is no additional instability. 
However, there is another type of exponential growth occuring at a slower rate than for the Y-filter, 
which is also starting at round-off at early times. On the right in Fig. [6j we show how various scalar 
harmonic modes behave during the run at some particular resolution. After the initial phase, the 
modes aoo, aio> &il> an d £>n remain below 10" 13 . The main instability is visible in 020 , 0,21, and 
621- It appears to start at t = at around 10 -15 , and then grows at a constant exponential rate. 
Notice how the unstable modes overtake the regular feature at about t = 15, 000 in the plot on the 
left. 

The Yn-filter cures both exponential modes that occur for the Y-filter and the Yg-filter. We 
compare the Yg-filter and the Yn-filter for t < 200, 000 in Fig. [7j The Yn-filter runs do not exhibit 
any exponential growth, although some linear growth remains, see the bottom panels. The linear 
growth is roughly the same for both filters, and it decreases with radial resolution. As far as the 
tensor character of the fields is concerned, the Yn-filter with iif > 3 should remove all instabilities 
due to an inconsistent treatment of tensors. However, other instabilities may well occur at a later 
time. If so, they are not yet visible in the mode decomposition by t = 200, 000, compare Fig. [HJ 

It may be worth recalling that the target of state-of-the-art binary black hole simulations is the 
last 10 or perhaps 20 orbits before merger, which corresponds to t < 10, 000. If the limitations of 
the present example would carry over, the method would comfortably satisfy the numerical stability 
requirement. Note also that typical code tests only report evolution times of up to t = 400 in [15] , 
t = 10, 000 in [4J, or t = 5, 000 in |14| . However, there is no reason that the simplest black hole test 
should not yield unlimited stability. In fact, the main limitation of our test case is that we ignore 
the available sophisticated outer boundary conditions for the GHG system, e.g. jU [331 EH E5] • As 
it turns out, simply increasing the radial dimension of the shell by moving the outer boundary 
from r max ~ 12 to 22 makes the runs fail in a way that appears to signal a breakdown due to 
the boundary condition. We leave the investigation of proper outer boundary conditions to future 
work since our focus here is on the construction of the CFF/Yn-method. 

4-2. Computational efficiency 

The choice of matrix multiplication methods for both the spectral derivatives and filters can 
be viewed as one of convenience, since it simplifies the implementation of spectral methods on the 
sphere, in particular for the tensor spherical harmonic filter. However, a priori it is not clear what 
the difference in performance is compared to the fast Fourier transform. If there was a significant 
performance penalty due to the 0(N 2 ) operations of matrix-vector multiplications compared to the 
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Figure 7: Single BH, rhs of gtt- Long term stability, comparison between the graded Yg-filter and the tensor Yn- 
filter. Top: The Yg-filter runs fail around t — 70,000, with an unstable mode visible at t — 15,000 around 10~ 12 . 
The Yn-filter runs last beyond t = 200, 000. Bottom: For both filters there is a linearly growing mode. Its slope is 
roughly the same for both filters, and it is smaller for higher radial resolution. 




Figure 8: Single BH, rhs of gtt- Long-term behavior for the tensor Yn-filter. Shown are the same quantities as 
in Fig. [6] The runs last for at least t — 200,000. A residual linear growth is visible, which is less than 10~ 10 per 
At = 100, 000, depending on resolution. 



0(N log N) of FFTs, then we should aim for a fast transform implementation (with the possible 
exception of the Legendre transform). However, as argued above, the 3d physics problem that we 
consider leads to transforms with iV < 50. For such small N, the matrix multiplication method 
can be even faster than the FFT |161 [I] . 
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Figure 9: Example for the performance of matrix multiplication and the fast Fourier transform relevant for pseu- 
dospectral differentiation on a CPU (left) and on two different CPUs (right). Shown is the runtime versus the 
leading dimension n\. We compare the multiplication of a n\ x n\ matrix and a n\ x (20x20x 54) matrix with 
the corresponding two Fourier transforms. As is typical for FFT implementations, performance depends strongly on 
the problem size. In this concrete example, on the CPU the matrix multiplication offers comparable performance 
for n\ < 70, while on the GPU ni < 100 comparing with the n\ for which the peak performance is compared. 
Considering all m, on average matrix multiplication is significantly faster than the FFT for the CPUs even beyond 
ni = 150. 



Fig. [9] shows a representative benchmark for our specific method. Part of the pseudospectral 
algorithm is the Id transform of several variables on a 3d grid, see (13) for the derivative d x and 
(24) for the Fourier analysis in the (^-direction. In Fig. |9j we compare the run time as a function 
of the leading dimension n\ (assumed contiguous in memory) for different implementations of the 
matrix multiplication of a n\ x n\ matrix and a ri\ x (20 x 20 x 54) matrix. For simplicity, we 
consider only this operation and do not include the computation of actual derivatives (the FFT 
method requires a transformation in Fourier space) or the generalization to all three directions. As 
example for CPU performance, we show results for a single core of a i7-870 CPU using FFTW 3.2.2 
for the Fourier transform and ATLAS 3.8.3 (sse3) for the matrix multiplication. As example for 
GPU performance, we consider NVIDIA's GTX580 and M2070 Fermi/Tesla cards running CUDA 
3.2 versions of CUBLAS and CUFFT. Notice in Fig. [9] that the Tesla card can outperform the 
GTX card only in special cases and only for the matrix multiplication. 

As expected, the performance of the matrix multiplication scales approximately like N 2 , while 
FFT performance depends strongly on the size of the transform (i.e. on the prime factors of n\). 
We vary n\ in steps of two, n\ = 4, 6, . . .. For the CPU, the matrix multiplication offers comparable 
performance for n\ < 70. For the GPUs, comparable performance is obtained for n\ < 100, but 
only when comparing with the optimal values of n\. Optimization for arbitrary n\ is currently not 
as even with CUFFT as with FFTW3. Although it might be feasible to eventually restrict physics 
runs to the available fast ni-FFTs, we note that on the GPUs on average matrix multiplication is 
significantly faster (by a factor of more than 10) than the FFT method even beyond n\ = 150. We 
therefore focus exclusively on the the matrix multipliation method in this work. 

In Tab. [2j we quote some results for the performance in Gflop per second of the three matrix- 



matrix multiplications required for the computation of 3d partial derivatives, see Sec. 2.4 Opti 



mization for the new Fermi chips was included in the transition from CUDA 3.1 to 3.2. Certain 
small matrix multiplications [36] started working on Fermi with CUDA 4. Ore. The Tesla cards 
C2050 and M2070 have four times the number of floating point units compared to the GTX se- 
ries. However, for the specific matrix sizes considered, a multiple of 64 is required in the leading 
dimension to benefit from the additional FPUs. For the small grids of the black hole example, 
we obtain around 50 to 100 Gflop/s. For somewhat larger grids the performance approaches 200 
Gflop/s, reaching roughly 300 Gflop/s on the Tesla cards when the leading dimension is 64. This is 
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Table 2: Performance of dgemm on GPUs/CUBLAS for the three matrix multiplications of pseudospectral deriva- 
tives. Numbers are in Gflop per second. For the small grids of the black hole example, we obtain around 50 to 
100 Gnop/s. For somewhat larger grids the performance approaches 200 Gflop/s, reaching 300 Gfiop/s on the Tesla 
cards when the leading dimension is 64. 
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Table 3: Performance of the matrix transpose on GPUs/CUDA for the four transposes used to implement the 3d 
pseudospectral derivatives. Numbers are in Gbyte/s. For comparison, the device-to-device copy operation from the 
CUDA SDK (bandwidthTest) is given. The combined transpose-matmul-transpose operation is dominated by the 
matrix multiplication. 
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Table 4: Performance of the spectral evolution of a single black hole on a GPU (GTX580, CUDA 4. Ore) compared 
to one core of a CPU (i7-870). For a given grid size, 1000 RK4 evolution steps are performed. The startup time is 
not counted. The matrix multiplications of the derivatives and the filter amount to about 75% of the runtime on 
the GPU, while the remainder is mostly due to the (simple but memory intensive) algebra of the right-hand-side of 
the Einstein equations. For the largest grid, the GPU implementation is about 20 times faster than the single core 
CPU implementation. 

close to the maximal performance reported for large square matrices in |37j . on which the current 
CUBLAS/Fermi optimization is based. The theoretical peak for these Tesla GPUs is around 500 
Gflop/s. 

Tab. [2] does not include the four transpose operations of the derivative calculation. In Tab. [3j 
we give memory transfer rates in Gbyte/s for the transposes. The Tesla card had ECC memory 
activated. Its transfer rate is only about half that of the GTX580 card. For the medium grid 
sizes the GTX580 and the M2070 both achieve about 100 Gflop/s, so it appears that the M2070 
compensates for the lower memory speed with its larger number of floating point units. We also 
compare with the bandwidth test for device-to-device copies included in the SDK, which results 
in lower numbers than the peak one-directional memory bandwidth of the cards (159, 192, and 
150 Gflop/s, respectively). The transposes are out-of-order copies that on the Fermi cards reach 
half the speed of the direct device-to-device copies. There probably is room for optimization of the 
transpose, but the derivative calculation mostly depends on the speed of the matrix multiplication. 
In Tab.[3j we also quote Gbyte/s for the matrix multiplication considered as a matrix to matrix copy 
operation, and these numbers are lower by a factor of 4 to more than 10 than for the transposes. 

In Tab. [4j we give benchmark results for the black hole example. For a given grid size, 1000 
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RK4 evolution steps are performed. The startup time is not counted, but all other parts of the 
calculation except input/output are included. The CFF/Y-filter method is a combination of matrix 
multiplications and transposes for the derivatives, and a collection of matrix multiplications of 
varying size for the filter. One other costly part of the computation is the algebra that is required 
in addition to the derivatives on the right-hand-side of the evolution equation. For the Einstein 
equations, this is a sizable problem with about 10000 floating point operations per grid point per 
RK4 update. Furthermore, there are about 200 variables (50 fields and their derivatives) plus a 
correspondingly large number of temporary variables used during the calculation, which exceeds 
by far the number of registers of the Fermi cards (less than 64 are available per thread). On a 
CPU, the algebra amounted to less than 10% of the overall calculation, but on the GPU the matrix 
part is more efficiently parallelized in the current implementation. 

The bottom line is that the GPU implementation is 21.8 times faster than the (single core) 
CPU implementation for the largest grid. For smaller grids the speedup is still on the order of 10 
to 17. A quad-core implementation using a BLAS library led to a speed up of 2 to 3 for the part of 
the matrix multiplication. Extrapolated for the complete algorithm for the largest grid, this would 
still leave a speed-up of 7 to 11 on the GPU. 

5. Conclusion 

We discussed the construction of a pseudospectral method for non-linear, time-dependent tensor 
fields on a spherical shell. The proposed CFF/Yn- method, i.e. a Chebyshev-double-Fourier basis 
combined with a spin-weighted spherical harmonic filter, was successfully implemented for the 
model problem of a single black hole. We demonstrated that a matrix method for both the spectral 
derivatives and the filter resulted both in analytic simplicity and also in ease and efficiency of the 
numerical implementation. To this end, we developed a matrix method for the discrete spin- 
weighted scalar harmonic projection for arbitrary spin weight. The parallelization of the CFF/Yn- 
method was evaluated for a GPU implementation. Numerical results for three different filter 
strategies were given, showing that the simple Y-filter and the improved, graded Yg-filter lead to 
instabilities, that however are cured by the tensor Yn-filter. 

In future work we intend to report on the generalization of the method to domain decomposi- 
tions. For example, the method as described here has already been successfully applied to multiple 
nested shells for scalar waves. Spherical shells are one of the building blocks for more general 
grids that are needed, for example, for two black holes. A key feature to implement for general 
applications in numerical relativity is the incorporation of appropriate outer boundary conditions. 
In our example, a single black hole could be easily simulated in the device memory of one GPU. 
Spectral methods may be able to realize an appealing local optimum in computational efficiency, 
if two black holes can be simulated on a single graphics card due to the memory efficiency of the 
spectral method. 

The GPU implementation is promising, giving speed-ups on the order of 10 to 20 compared to 
a single core of a CPU. A next step in the optimization would be to use parallel kernels (which 
recently has become possible for CUDA) for the different small matrix operations in the spherical 
harmonic filter. Although we focused on CUDA, the important step is the organisation of the 
algorithm in terms of matrix operations, which should be equally helpful for other platforms. In 
general, pseudospectral matrix methods would benefit most from further optimization of a subclass 
of matrix multiplications that is somewhat outside the mainstream, i.e. the product of small, square 
matrices with large, highly non-square matrices. 

Appendix A. Formulation of the Einstein evolution problem 

In this section we collect the equations for the generalized harmonic gauge (GHG) system and 
the single black hole test case. The GHG system that we focus on here is the version introduced 
in [3J, which is a first order in time and first order in space reformulation of the Einstein equations. 
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A first order harmonic system of this type was first considered in [3S], although well-posedness 
and numerical stability requires the modifications introduced by [3] , in particular constraint damp- 
ing |39| 140] . The generalized harmonic gauge was introduced in [IT]. It played a major role in the 
binary black hole evolutions of |42| 14"4"] , which used a second order harmonic formulation. See 
[4"5l US] for further applications. We give a short synopsis for the test case of a single black hole, 
which nevertheless is quite complicated since the stability tests are performed for the full Einstein 
equations in 3d. Although much of the material is contained or at least implicit in [4], it should 
be helpful to readers not familiar with numerical relativity to spell out some of the details. The 
notation is adapted to the 3+1 decomposition of [47] . 

Appendix A.l. Einstein equations in generalized harmonic gauge 

The goal is to solve the Einstein equations of classical general relativity in vacuum, 

R ab (g,dg,d 2 g) = 0, (A.l) 

for the 4-metric g ab . The Ricci tensor R ab is constructed from first and second derivatives of the 
metric. The construction of the GHG system starts with the observation that the Ricci tensor can 
be written as 

R ab = -\g cd d c d d g ab + d {a T b) - g cd T cab T d + g cd g ef {d e g ac d f g bd - T ace T bdf ), (A.2) 
where we have introduced the Christoffel symbol of the metric and one of its contractions, 

r ca & = ^(d a g bc + d b g ac - d c g ab ), T c = g ab T cab . (A.3) 



Note that in (A.2) the second derivatives of the metric are conveniently separated into g cd d c d d g ab 
and d( a T b y The first term represents a standard wave operator, while the second does not. 

We can choose harmonic coordinate functions x a for which r a = — V b V b x a = 0. In this gauge 
the principal part of the Ricci tensor consists only of the wave operator, leading to a symmetric 
hyperbolic system. Generalized harmonic coordinates satisfy r a = — H a for some given gauge 
source functions H a that may depend on the coordinates and the metric, but not on the derivative 
of the metric. Since H a does not contribute to the principal part, we again arrive at a second 
order symmetric hyperbolic system. The GHG is based on a modified Einstein equation, where the 
coordinates are incorporated through the constraint function C a = H a + T a , see in particular [3§] 
on the Z4 system. This suggests a constraint damping scheme which is essential for the stability 
of the GHG system 00]. 

In order to discuss the GHG system as a Cauchy problem, we assume that the coordinates 
naturally split into time and space, x a = (t,Xi). The spacetime normal to the hypersurfaces of 
constant time t is given by n a = —aV a t, with the lapse function a chosen such that n a n a = — 1. 
The time- flow vector field t a = (1,0*) is given by t a = an a + f3 a , where the shift vector f3 a is 
tangential to the hypersurface, n a /3 a = 0. We have 

n a = (-a,0i), n a = g ab n b = (-,-^). (A.4) 

a a 

The 4-metric g ab induces a 3-metric 7™ in the hypersurface, and determines lapse and shift, 



Hj=9ij, Pj = 9tj, P l = l ij Pj, a = y/ftp - g n - (A.5) 

The inverse 4-metric is denoted by g ab , and the inverse 3-metric is denoted by 7^. Notice that 
9ij = 7«) but g lJ = 7 lJ — /3 l /3^ /a 2 . Raising and lowering indices for 4d indices is done with the 
4-metric, and for 3d indices the 3-metric is used. 
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A first order version of the GHG system can be obtained straightforwardly by introducing new 
variables for the first derivatives of the metric, 



di, 



iab 



dig, 



k 



ab 



~{d t g a b 

a 



P%9, 



ab) 



(A.6) 



where the time derivative is in direction of the hypersurface normal, k ab = —n c d c g ab . The resulting 
first order system was first discussed in [38] . The modifications of [I] for stability involve constants 
7o> 7l> 72, and 73. Choose 73 = 7172 to obtain symmetric hyperbolicity for all 71 and 72. Choose 
71 = — 1 to avoid certain shocks. Less clear is the choice of 70, which controls the Gundlach-type 
constraint damping involving the gauge constraint C a = T a + H a , and the choice of 72, which 
appears as a coefficient of the constraint Ci a b = Q%9ab — di a b due to the introduction of first order 
variables. We choose 70 = 1 and 72 = 1, which is reported to lead to stable evolutions in standard 
numerical experiments 

The GHG system in first order form including the modifications for stability takes the form 



dtu 1 " = -A k » v d k u v + S^, 



(A.7) 



where u M = {g ab , k ab , d{ ab } is the vector of variables, and where A k ^ v and depend on u' 1 but not 
its derivatives. Written out explicitly, 



d t g a b 

dtdiab 

d t k a b 



-ak ab + (3 l d iab , 

/3 k d k d iab - adik ab + adig ab + -an c n d d icd k ab + aj jk n c d ijc d kab - ad iab , 
(3 k d k k ab - af k d k d iab - /3 k d k g ab + 2ag cd (<y ij d iac d jbd - k ac k bd - g ef T ace T bdf , 
-2aV (a i? fe ) - -cm c n d k cd k ab - an c k ci y j d jab 
+a{25 c (a n b) - g ab n c ){H c + T c ) + P% ab . 



(A.8) 
(A.9) 



(A.10) 



These equations assume that H a is given, for example through an additional evolution equation. 
The complete system involves a state vector = {g ab , k ab , di ab , H a } or even = {g ab ,k ab ,di ab , 
H a ,dtH a }, if the evolution of H a is specified by an equation that is second order in time. Since 
g ab and the other variables are symmetric in a and b, there are 50, 54, or 58 variables in u^, 
respectively. 

Since constraint conservation is non-trivial, once a set of independent variables has been chosen 
we have to strictly distinguish between dependent and independent variables. For example, the 
variable di ab and the first spatial derivative of the variable g ab are treated separately, since they 
are only equal if the corresponding constraint Ci ab = is satisfied. We collect the relations needed 



to compute the dependent quantities appearing in ( A.8 )-( A. 10 ) from the u^. Quantities obtained 
from the 3+1 split of g ab are n a , a, and 7^, see (A.4)-(A.5). The inverse metrics g ab and 7*-' 



arc 



computed as inverses from the component matrices. The "covariant" derivative of H a is defined 
as V a H b = d a H b — g cd T cab H d . The Christoffel symbols T abc and T a are computed fron 
the following expressions for d c g ab in terms of undifferentiated dynamical variables, 



dig, 



ab 



d 



iab j 



dtg a b = -ak ab + (3 l d iab , 



(A.11) 



compare (A.6) and (A.8). All other partial derivatives appearing in (A.8)-(A.10) including d a H[ 



are computed directly (e.g. numerically) as derivatives of the as required for the formulation 



(A.7) 



Appendix A. 2. Boundary conditions 

The boundary conditions imposed on the evolution system play a crucial role in achieving 
well-posedness and numerical stability. The topic has received a lot of attention, and there is a 
number of usually rather complicated boundary conditions. These are often more complicated 
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than the evolution equations themselves since e.g. for constraint conservation at the boundary the 
time evolution of the constraints is needed, which requires higher than second order derivatives 
of the metric. For the investigation of boundary conditions, we summarize the characteristic 
eigenvalue problem of the GHG system following [U [15] . Consider a normalized spatial vector s 1 
with SiS 1 = r yijS l s :) = 1. When considering 2d boundaries within constant time hypersurfaces, the 
vector s 1 is the outward pointing unit normal to the boundary. The eigenvalue problem associated 
with (A. 7) in direction s l is 



(A.12) 

where the characteristic matrix is s k A kfJ- u , the left eigenvectors are denoted by e a ^, and the eigen- 
values by vr&\. The index a labels eigenvalues and eigenvectors (and is not summed over on 
the right- hand-side) . e a M depends on s\ which typically depends on space and time due to the 
normalization with respect to 7$,-. 

We derive some explicit expressions for the first order GHG system (A.l 
suppress the ten components in the symmetric tensor indices. With b = s k j3 k . 



(A. 10), where we 




Sk 








as 



(A.13) 



Mathematica finds, considering the (sA) T right eigenvector equation, transposing the result to 
obtain the left eigenvector matrix with the eigenvectors written in the rows, scaling the eigenvectors 
for convenience, and ordering the eigenvalues and eigenvectors to correspond more closely to [3], 



/ \ 



Vfri 



(a) 



-a 
-a 



V 



/ 



/ 1 

-1 1 

-1 1 



V 





s 1 
-s l 

-S3 
-S2 





s 2 
-s 2 




\ 

s 3 

-s 3 

Sl 

/ 



(A.14) 



This representation assumes that si 7^ 0. The two eigenvectors for eigenvalue —b are orthogonal 
to Sj. Alternatively, [3J write uf = P k id k for three fields obtained by orthogonal projection, all 
with eigenspeed —b. We introduce the projection onto directions tangential to the boundary and 
orthogonal to the boundary normal, P k i = 5 k — s fc Sj. 

The standard way to impose boundary conditions for symmetric hyperbolic systems is to impose 



conditions on the characteristic fields. To this end, we split the partial derivatives in (A. 7) with 
5 k = P k i + s k Si and project (A. 7) onto eigenvectors, resulting in 



-V(&)e »s 



■'d k u» - e & fl P k i A i ^d k u u + e%S"- 



(A.15) 



In other words, we obtain an advection equation in the direction of s l with characteristic speeds 
given by the eigenvalues, plus terms involving derivatives tangential to the boundary. Equation 



(A.15) allows us to specify boundary conditions that distinguishes between incoming and outgoing 



modes according to vr^ < and Vf&\ > 0, respectively. 

Here we focus on the simplest condition that is successful for the single black hole test case. For 
the case of a Schwarzschild black hole, [1] reports that freezing the incoming characteristic fields, 



for vr$) < 



(A.16) 



boundary 

gives stable evolutions. Therefore the most basic stability test does not involve the complicated 
constraint characteristics. Since we evolve the and not the characteristic fields, a boundary 



condition on the characteristic fields cannot be implemented directly. Instead, we transform (A.7) 
with e a ^, set some of the time derivatives to zero according to (A.16), and transform back with the 
inverse of e a ll . This procedure can be combined into a transformation by a single matrix E~ l ZE, 
where Z is a diagonal matrix with on the diagonal if vr&\ < and 1 on the diagonal if v^) > 0. 

29 



Appendix A. 3. Test case of a single, spherically symmetric and static black hole 

As test case we consider the Schwarzschild spacetime, which describes a single, spherically 
symmetric and static black hole [48J. We write the Schwarzschild metric in Kerr-Schild form, 



9ab = Vab + flak, f 



2M 
r 



L = (1, 



.X .; , 



(A.17) 



where r\ ao is the Minkowski metric in coordinates (t,Xi), r = (S^XiXj)^ = {x +y +z )2, and M is 
the mass of the black hole. The horizon is located at r = 2M. A specific feature of the Kerr-Schild 
form is that the vector l a is null (l a l a = 0) with respect to both r/ a & and g a0 . We have chosen to 



scale l a such that kl l 
Euclidean 3-metric. 



rfHilj = 1, so V 1 = 5 l Hj is the normalized radial vector with respect to the 



The metric (A.17) solves the Einstein equations, and all coordinate time derivatives vanish, 
dtdab = 0. We have chosen geometrical units, G = c = 1. Furthermore, we set the black hole mass 
to one, M = 1, so all quantities including length and time are dimensionless. The numerically 
experiment consists of posing initial data based on (A.17) for t = 0, and to study the numerical 
evolution of this data. 

Initial data = {g a b, k a b, di ao } at t = is computed from g a b{t, Xi), (A.17), using the definition 



of the first order variables, (A.6). We compute the spatial derivatives in (A.6) numerically from 
(A.17). In general, k a0 requires in addition the time derivative of the metric. However, for our 
example, d t g ab = 0. 

We perform the evolution in the generalized harmonic gauge, where the gauge source function 



is initialized based on the Kerr-Schild metric (A.17), which is a non-constant function of the x 
and which is left constant during the evolution, 



H a (t = 0) = -T a (t = 0), d t H a = Q. 



(A.18) 



In [3], the gauge condition for this test case is not stated explicitly, but based on [33] we assume 
that (A.18) was used, since it is equivalent to initializing H a with the condition that dta = and 
dt/3 l = 0. We note in passing that in a different formulation for the same type of Kerr-Schild black 
hole it was found that the gauge functions a and f3 % have to be allowed to evolve in order for a 
numerically stationary solution to be found [3]. In the present case, the gauge source H a is static, 
but nevertheless lapse and shift can evolve. 

We conclude with a comment on the characteristic speeds of the GHG system for the Kerr-Schild 
metric. In terms of 3+1 variables, the Kerr-Schild metric becomes 



Hj = Sij + fk{ 



J- 



a 



(1 + 7)1/2' 



/3 i = r 



f 



i + / 



(A.19) 



The outward pointing normal Sj to a boundary surface of constant r is proportional to li, but 
normalized with respect to jij. Since 7*- 7 = — j^jPP, we have Sj = h/ y/^hlj = Uy/1 + /, and 
b = Si(3 l = f '/y/l + / = fa. According to (A. 14), the characteristic speeds vr&\ assume values 0, 
dba — b = (±1 — f)a, and —b = —fa. For r — > oo, b — > and a — > 1. Asymptotically for large 
distances, the speeds are therefore and ±1. Only the mode with speed a — b can be positive for 
the given data. It vanishes at the horizon at r = 2, and a — b > for r > 2. At the horizon and 
actually for all r < 2, all speeds are < 0, so at and below the horizon there are no incoming modes 
and no extra boundary condition is required. Since the eigenvalues are linked to the time-stepping 
stability, we note that in the numerical example r m i n = 1.80, and the eigenvalues range from —1.45 
to some value less than -(-1 at the outer boundary. For r m j n — 2.00, the fastest eigenspeed is 
-V2 = -1.41, for r min = 1.50 it is -1.53. 



At the outer boundary, the modes for a = 2,3,4 with v 



-a 



b, -b, 



-b are incoming. 

These are the modes that we freeze for the simplistic boundary condition (A. 16). In this special 
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case, the combined transformation E ZE becomes 

B(d t g a b) = d t g a b, B(d t k ab ) = ^{+d t g a b + d t k ab + s k d t d ka b), (A. 20) 

B(d t d iab ) = ^Si(-d t g a b + d t k ab + s k d t d ka b), ( A - 21 ) 

where B denotes the boundary values for the right-hand-sides determined by freezing the incoming 
modes. 

Appendix B. Examples for spin- weighted spherical harmonics 

We write the spin-weighted spherical harmonics as Y7L = P," (cos 9)e tm ^. The PPl are nor- 
malized Wigner d-functions, which in this context could also be called normalized spin-weighted 
associated Legendre polynomials. The first few PPl for n >= are: 

p0 _ 1 pi _ n p2 _ n 

0,0 — 2v^F 0,0 — u M),0 — u 



p i,-l = MO) p i-l = -ty* cos 2 (f ) P 2 -i = 

A°,o = § Vl cos(0) P/,0 = \{h MO) A 2 = o 

= -ky/£<**W A 1 ,! = 4^ sin2 (I) A 2 i = o 

Pl-2 = \^Ji^\0) pi_ a = -^/J cos 3 (f) sin (|) P 2 _ 2 = i^f cos 4 ({ ) 

Pl-i = \M™ie)M0) Pl-i = -\\fl cos 2 (|) (2 cos(0) - 1) = "7f cos 3 (|) sin (|) 

A°o = | V ^(3cos(20) + 1) Pl = cos(fl) sin(0) P 2 = sin 2 (0) 

= -^yicos(0)sin(0) P^ = - i^(2ooB(e) + l)sin 2 (§) P 2 , = -^fcos (§) sin 3 (§) 

% = lS s ' m2 ^ p l* = v^ cos B) sin3 (§) ^ 2 = ^ 4 (I) 

(B.l) 

For n < 0, we have Pp m = {-l) n+m P^ m . For example, P^ 1 = -(-l) m P z +_! m - If ra = 0, we can 
avoid the computation for m < 0, but in general we need either the m < or n < computation 
in order to obtain the other terms through a simple sign flip. 

There is no orthogonality for different spin weights, i.e. (Y™ m ,, Y^) can be non-zero even though 
n' 7^ n. For example, 



(Y n \Y 1 \)= 1 -, (^UA) = ^ (nW) = l, (YA,Yl) = ^. (B.2) 

As an example for expanding a vector component, consider the unit vector in the x-direction, 
v l = 8\. One of its tetrad components is v l rrii = m x , which has the expansions 



1 , . . „ /2vr 



m x = —=(cos6cos(j) — ism(f>) = \j — {Yii ~ ^i -l) (B-3) 
V 2 V 3 

= ^8^ 3/2 (^-i + + 4V5(l5_ a - 1&) + + + •••)• (B.4) 

m x is a two term linear combination of the Y^, but an infinite series in terms of the YP (and 
also the Y^). The vector component 9 X = -^( m x + %) = cos 9 cos does not have a finite series 
representation for any specific n, since it is the linear combination of two vectors with different 
spin weight. Not only do we obtain infinite series in specific cases, but they typically converge only 
slowly because terms like cos 9 cos <f> are not continuous as functions on the sphere. However, the 
spin-weighted harmonics are defined in such a way that the specific discontinuities introduced by 



the complex tetrad vectors are exactly resolved, e.g. as in (B.3). 
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